- Title
- How to mitigate the incident? An effective troubleshooting guide recommendation technique for online service systems
- Creator
- Jiang, Jiajun; Lu, Weihai; Chen, Junjie; Lin, Qingwei; Zhao, Pu; Kang, Yu; Zhang, Hongyu; Xiong, Yingfei; Gao, Feng; Xu, Zhangwei; Dang, Yingnong; Zhang, Dongmei
- Relation
- 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (online 8-13 November, 2020) p. 1410-1420
- Relation
- ARC.DP200102940 http://purl.org/au-research/grants/arc/DP200102940
- Publisher Link
- http://dx.doi.org/10.1145/3368089.3417054
- Publisher
- Association for Computing Machinery
- Resource Type
- conference paper
- Date
- 2020
- Description
- In recent years, more and more traditional shrink-wrapped software is provided as 7x24 online services. Incidents (events that lead to service disruptions or outages) could affect service availability and cause great financial loss. Therefore, mitigating the incidents is important and time critical. In practice, a document describing a mitigation process, called a troubleshooting guide (TSG), is usually used to reduce the Time To Mitigate (TTM). To investigate the usage of TSGs in real-world online services, we conduct the first empirical study on 18 real-world, large-scale online service systems in Microsoft. We analyze the distribution and characteristics of TSGs among all incident records in the past two years. According to our study, 27.2% incidents have TSG records and 36.2% of them occurred at least twice. Besides, on average developers spend around 36.3% of the entire mitigation time on locating the desired TSGs. Our study shows that incidents could occur repeatedly and TSGs could be reused to facilitate incident mitigation. Motivated by our empirical study, we propose an automated TSG recommendation approach, DeepRmd, by leveraging the textual similarity between incident description and its corresponding TSG using deep learning techniques. We evaluate the effectiveness of DeepRmd on 18 online service systems. The results show that DeepRmd can recommend the correct TSG as the Top 1 returned result for 80.3% incidents, which significantly outperforms two baseline approaches.
- Subject
- incident management; incident mitigation; troubleshooting guide; online service systems
- Identifier
- http://hdl.handle.net/1959.13/1435945
- Identifier
- uon:39871
- Identifier
- ISBN:9781450370431
- Language
- eng
- Reviewed
- Hits: 1168
- Visitors: 1163
- Downloads: 3
Thumbnail | File | Description | Size | Format |
---|